Skip to content

Conversation

@heemankv
Copy link
Contributor

@heemankv heemankv commented Nov 20, 2025

PR Type

  • Bugfix
  • Feature
  • Code style update
  • Refactoring
  • Build-related changes
  • Documentation content changes
  • Testing
  • Other

What is the current behavior?

Critical issues preventing production deployment:

  1. L1 Settlement Client: Madara fails to start if L1 endpoint is down at startup, and crashes when L1 becomes unavailable during runtime
  2. Gateway Client: Madara stops syncing when gateway calls fail, causing data loss and service interruption

What is the new behavior?

L1 Resilience

  • Lazy initialization: Madara starts successfully even when L1 is down
  • Infinite retry: All L1 operations retry indefinitely with phase-based backoff (Aggressive → Backoff → Steady)
  • Event stream resilience: Event streams automatically recreate on failure
  • Health monitoring: Real-time health tracking with automatic recovery detection

Gateway Resilience

  • Infinite retry on reads: All GET operations retry indefinitely until gateway recovers
  • Health monitoring: Tracks gateway availability with automatic recovery
  • Fast-fail on writes: Transaction submissions fail quickly (by design - prevents duplicate txs)

Shared Infrastructure

  • ✅ New mp-resilience crate with reusable retry/health primitives
  • ✅ Phase-based retry: 2s → exponential → 60s intervals
  • ✅ Clean recovery transitions in <2s
  • ✅ Prevents rapid state oscillations on flaky connections

Does this introduce a breaking change?

No - All changes are backward compatible. The health monitor now returns a JoinHandle for graceful shutdown, but callers can ignore it.


Other Information

Production-Ready Fixes Applied

  1. Fixed unbounded memory growth in health tracker
  2. Fixed SystemTime panic (handles NTP adjustments)
  3. Added cancellation checks for fast shutdown
  4. Separate retry contexts for stream creation vs. event processing
  5. Improved error logging and structured logging throughout

Testing

  • ✅ All unit tests pass (8/8 in mp-resilience)
  • ✅ Full project builds successfully
  • ✅ Manual testing: Madara survives L1/Gateway outages and auto-recovers

Files Changed

  • madara/crates/primitives/resilience/ (new crate)
  • madara/crates/client/settlement_client/src/eth/mod.rs
  • madara/crates/client/settlement_client/src/messaging.rs
  • madara/crates/client/settlement_client/src/gas_price.rs
  • madara/crates/client/gateway/client/src/methods.rs
  • madara/crates/client/gateway/client/src/retry.rs

Resolves: Production stability issues with L1 and Gateway connectivity
Here's the log file for Gateway :
test_gateway_final.log

Here's the log file for L1 :
test_final_l1.log

@Mohiiit Mohiiit added madara bug Report an issue or unexpected behavior labels Nov 21, 2025
@heemankv heemankv marked this pull request as ready for review November 23, 2025 12:41
@heemankv heemankv marked this pull request as draft November 23, 2025 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Report an issue or unexpected behavior madara

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants